# Data Processing Pipeline for CREATION_DOFILES

This repository contains the SAS programs used to process and analyze graduate data, earnings, and residency information across multiple state educational systems. The pipeline is designed to transform raw administrative records into a longitudinal dataset suitable for econometric analysis.

## 1. Repository Structure

The following scripts should be executed in order to maintain data integrity and satisfy cross-program dependencies:

### Core Pipeline
* **`01.grad_readin.sas`**: Appends graduate data from multiple state systems (including UT System, CDHE, SUNY, CUNY, PSU, ODHE, THECB, and others) and merges them with demographic data from the Individual Characteristics File (ICF).
* **`02.earnings_merge.sas`**: Links the graduate cohort to quarterly earnings data using hash tables to aggregate earnings by state and overall. It produces an annual longitudinal file (`allearn_long_annual`).
* **`02b.earnings_merge.sas`**: A variant of the earnings merge script used for robustness checks, specifically flagging "terminal runs" of zero earnings for the sandwich estimator test mentioned in the manuscript.
* **`03.sample_restrictions.sas`**: The final processing step. It applies research-specific filters (CIP code exclusions), creates flagship institution dummies, and exports finalized datasets (`all_earnings_long.dta` and `earnings_regressions.dta`) for Stata.

### Auxiliary Data Processing
* **`readin_nonemp.sas`**: A specialized program to process self-employment (non-employer) data from 2002 to 2016, converting it from a wide format to a longitudinal structure.
* **`residence_creation.sas`**: Standardizes residency definitions across various state systems (CO, NY, OH, TX, UT, MN, GA, VA, MO) by merging enrollment and demographic records.

---

## 2. Requirements and Data Environment

* **Software**: SAS (Statistical Analysis System).
* **Input Data**:
    * **Education Records**: State-specific graduate datasets (e.g., `GRADS.grads_us_thecb`).
    * **Demographics**: Census Bureau's Individual Characteristics File (ICF).
    * **Earnings**: Administrative PHF (Person History File) records for quarterly wages.
    * **Self-Employment**: Non-employer statistics (2002–2016).

## 3. Usage and Reproducing Results

1.  **Initialize Residency Flags**: Run `residence_creation.sas` to ensure consistent residency status across state lines.
2.  **Compile Cohorts**: Execute `01.grad_readin.sas` to create the master graduate list with demographic controls.
3.  **Process Self-Employment**: Run `readin_nonemp.sas` to prepare the self-employment longitudinal file.
4.  **Merge Earnings**: Run `02.earnings_merge.sas` to link wage data to the graduates.
5.  **Apply Sample Selection**: Run `03.sample_restrictions.sas` to apply final filters, such as excluding specific degree programs (CIP codes starting with 30, 10, 46, 60, 03) and identifying flagship graduates.

## 4. Key Methodology Notes

* **Flagship Dummies**: Identified via specific OPEID codes (e.g., University of Texas at Austin, University of Colorado Boulder).
* **Sample Selection**: Excludes specific degree programs (CIP codes starting with 30, 10, 46, 60, 03) and focuses on graduates from 2001 onwards.
* **Reproducibility**: The pipeline uses consistent PIK (Protected Identification Key) identifiers to link individuals across education and labor market datasets.